Diagnosing Diseases using kNN

An application of kNN to diagnose Diabetes

Author

Jacqueline Razo & Elena Boiko (Advisor: Dr. Cohen)

Published

March 11, 2025

Slides: slides.html ( Go to slides.qmd to edit)

Introduction

The k-Nearest-Neighbors (kNN) is an algorithm that is being used in a variety of fields to classify or predict data. The kNN algorithm is a simple algorithm that classifies data based on how similar a datapoint is to a class of datapoints. One of the benefits of using this algorithmic model is how simple it is to use and the fact it’s non-parametric which means it fits a wide variety of datasets. One drawback from using this model is that it does have a higher computational cost than other models which means that it doesn’t perform as well or fast on big data. Despite this, the model’s simplicity makes it easy to understand and easy to implement in a variety of fields. One such field is the field of healthcare where kNN models have been successfully used to predict diseases such as diabetes and hypertension. In this paper we will focus on the methodology and application of kNN models in the field of healthcare to predict diabetes, a pressing public health problem.

Literature Review

The literature review explores the theoretical background of kNN and key factors affecting its performance. Recent advancements in optimizing kNN for large datasets and the role of kNN in medical diagnosis, particularly diabetes prediction.

Theoretical Background of KNN

kNNs are supervised learning algorithms that work by comparing a data point to other similar data points to label it. It works on the assumption that data points that are similar to each other must be close to each other. In the thesis (Z. Zhang 2016), the author gave the reader an introduction to how kNN works and how to run a kNN model in R studio. He describes the methodology as assigning an unlabeled observation to a class by using labeled examples that are similar to it. It also describes the Euclidean distance equation which is the default distance equation that is used for kNNs. The author also describes the impact the k parameter has on the algorithm. The k parameter is the parameter that tells the model how many neighbors it will use when trying to classify a data point. Zhang recommends setting the k parameter equal to the square root of the number of observations in the training dataset.

Although Zhang’s recommendation to set the k parameter could be a great starting point, the thesis (S. Zhang et al. 2017) proposed the decision tree-assisted tuning to optimize k, significantly enhancing accuracy. The authors of this thesis propose using a training stage where we use a decision tree to select the ideal number of k values and thus make the kNN more efficient. The authors deployed and tested two more efficient kNN methods called kTree and the k*Tree methods. They found their method did reduce running costs and increased classification accuracy.

Another big impact on accuracy is the distance the model uses to classify neighbors. Although the euclidean distance is the default distance that is used in kNNs there are other distances that can be used. In the thesis (Kataria and Singh 2013) the authors compare different distances in classification algorithms with a focus on the kNN algorithm. It starts off explaining how the kNN algorithm uses the nearest k-neighbors in order to classify data points and then describes how the euclidean distance does this by putting a line segment between point a and point b and then measuring the distance using the euclidean distance formula. It moves on to describe the “cityblock” or taxican distance and describes it as “the sum of the length of the projections of the line segment”. It also describes the cosine distance and the correlation distance and then compares the performance of the default euclidean distance to the performance of using city block, cosine and correlation distances. In the end it found the euclidean distance was more efficient than the others in their observations.

Syriopoulos et al. (Syriopoulos et al. 2023) also reviewed distance metric selection, confirming that Euclidean distance remains the most effective choice for most datasets. However, alternative metrics like Mahalanobis distance can perform better for correlated features. The review emphasized that selecting the right metric is dataset-dependent, influencing classification accuracy.

Challenges in Scaling kNN for Large Datasets

While kNN is simple and effective, it struggles with computational inefficiency when working with large datasets since it must calculate distances for every new observation. This becomes a major challenge in big data, where the sheer volume of information makes traditional kNN methods slow and resource-intensive.

To address this, Deng et al. (Deng et al. 2016) proposed an improved approach called LC-kNN, which combines k-means clustering with kNN to speed up computations and enhance accuracy. By dividing large datasets into smaller clusters, their method reduces the number of distance calculations needed. After extensive testing, the authors found that LC-kNN consistently outperformed standard kNN, achieving higher accuracy and better efficiency. Their study highlights a key limitation of traditional kNN (without optimization, its performance significantly declines on big data) and offers an effective solution to improve its scalability.

Continuing and summarizing these ideas, Syriopoulos et al. (Syriopoulos et al. 2023) explored techniques for accelerating kNN computations, such as:

  • Dimensionality reduction (e.g., PCA, feature selection) to reduce data complexity.
  • Approximate Nearest Neighbor (ANN) methods to speed up distance calculations.
  • Hybrid models combining kNN with clustering (e.g., LC-kNN) to improve efficiency.

This approach enhanced both speed and accuracy, making it a promising solution for handling large datasets. In addition, the study categorizes kNN modifications into local hyperplane methods, fuzzy-based models, weighting schemes, and hybrid approaches, demonstrating how these adaptations help tackle issues like class imbalance, computational inefficiency, and sensitivity to noise.

Another key challenge for kNN is its performance in high-dimensional datasets. The 2023 study by Syriopoulos et al. evaluates multiple nearest neighbor search algorithms such as kd-trees, ball trees, Locality-Sensitive Hashing (LSH), and graph-based search methods that enable kNN performance scaling for larger datasets through minimized distance calculations.

The enhancements to kNN have substantially increased its performance in terms of speed and accuracy which now allows it to better handle large-scale datasets. However, as Syriopoulos et al. primarily compile prior research rather than conducting empirical comparisons, further work is needed to evaluate these optimizations in real-world medical classification tasks.

kNN in Disease Prediction: Applications & Limitations

Disease Prediction with kNN

kNN has been widely used for diabetes classification and early detection. Ali et al. (Ali et al. 2020) tested six different kNN variants in MATLAB to classify blood glucose levels, finding that fine kNN was the most accurate. Their research highlights how optimizing kNN can improve classification performance, making it a valuable tool in healthcare.

In turn, Saxena et al. (Saxena, Khan, and Singh 2014) used kNN on a diabetes dataset and observed that increasing the number of neighbors (k) led to better accuracy, but only to a certain extent. In their MATLAB-based study, they found that using k = 3 resulted in 70% accuracy, while increasing k to 5 improved it to 75%. Both studies demonstrate how kNN can effectively classify diabetes, with accuracy depending on the choice of k and dataset characteristics. Ongoing research continues to refine kNN, making it a more efficient and reliable tool for medical applications.

Feature selection is another critical factor. Panwar et al. (Panwar et al. 2016) demonstrated that focusing on just BMI and Diabetes Pedigree Function improved accuracy, suggesting that simplifying feature selection enhances model performance. The study of Suriya and Muthu (Suriya and Muthu 2023) showed that kNN is a promising model for predicting type 2 diabetes, showing the highest accuracy on smaller datasets. The authors tested three datasets of varying sizes from 692 to 1853 rows and 9-22 dimensions to test the kNN algorithm’s performance and found that the larger dataset requires a higher k-value. Besides, PCA analysis to reduce dimensionality did not improve model performance. This suggests that simplifying the data doesn’t always lead to better results in diabetes prediction. The same findings about PCA influence on ML models implementation, and kNN in particular, showed in the research of Iparraguirre-Villanueva et al. (Iparraguirre-Villanueva et al. 2023). Also, they confirmed that kNN alone is not always the best choice. Authors compared kNN with Logistic Regression, Naïve Bayes, and Decision Trees. Their results showed that while kNN performed well on balanced datasets, it struggled when class imbalances existed. While PCA significantly reduced accuracy for all models, the SMOTE-preprocessed dataset demonstrated the highest accuracy for the k-NN model (79.6%), followed by BNB with 77.2%. This reveals the importance of correct preprocessing techniques in improving kNN model accuracy, especially when handling imbalanced datasets.

Khateeb & Usman (Khateeb and Usman 2017) extended kNN’s application to heart disease prediction, demonstrating that feature selection and data balancing techniques significantly impact accuracy. Their study showed that removing irrelevant features did not always improve performance, emphasizing the need for careful feature engineering in medical datasets.

kNN Beyond Prediction: Handling Missing Data

While kNN is widely known for classification, it also plays a key role in data preprocessing for medical machine learning. Altamimi et al. (Altamimi et al. 2024) explored kNN imputation as a method to handle missing values in medical datasets. Their study showed that applying kNN imputation before training a machine learning model significantly improved diabetes prediction accuracy - from 81.13% to 98.59%. This suggests that kNN is not only useful for disease classification but also for improving data quality and completeness in healthcare applications.

Traditional methods often discard incomplete records, but kNN imputation preserves valuable information, leading to more reliable model performance. However, Altamimi et al. (2024) also highlighted challenges such as computational costs and sensitivity to parameter selection, reinforcing the need for further optimization when applying kNN to large-scale medical datasets.

Comparing kNN Variants & Hybrid Approaches

Research indicate that kNN works well for diabetes prediction, but recent studies demonstrate it doesn’t consistently provide the best results. The study by Theerthagiri et al. (Theerthagiri, Ruby, and Vidya 2022) evaluated kNN against multiple machine learning models such as Naïve Bayes, Decision Trees, Extra Trees, Radial Basis Function (RBF), and Multi-Layer Perceptron (MLP) through analysis of the Pima Indians Diabetes dataset. The research indicated that kNN performed adequately but MLP excelled beyond all other algorithms achieving top accuracy at 80.68% and leading in AUC-ROC with an 86%. Despite its effectiveness in classification tasks, kNN’s primary limitation is its inability to compete with advanced models like neural networks when processing complex datasets.

In turn, Uddin et al.(Uddin et al. 2022) explored advanced kNN variants, including Weighted kNN, Distance-Weighted kNN, and Ensemble kNN. Their findings suggest that:

  • Weighted kNN improved classification by assigning greater importance to closer neighbors.
  • Ensemble kNN outperformed standard kNN in disease prediction but required additional computational resources.
  • Performance was highly sensitive to the choice of distance metric and k value tuning.

Their findings suggest that kNN can be improved through modifications, but it remains highly sensitive to dataset size, feature selection, and distance metric choices. In large-scale healthcare applications, Decision Trees (DT) and ensemble models may offer better trade-offs between accuracy and efficiency. These studies highlight the ongoing debate over kNN’s role in medical classification - whether modifying kNN is the best approach or if other models, such as DT or ensemble learning, provide stronger performance for diagnosing diseases.

kNN continues to be a valuable tool in medical machine learning, offering simplicity and strong performance in classification tasks. However, as research shows, its effectiveness depends on proper feature selection, optimized k values, and preprocessing techniques like imputation. While kNN remains an interpretable and adaptable model, newer methods - such as ensemble learning and neural networks - often outperform it, particularly in large-scale datasets. For our capstone project, exploring feature selection, fine-tuning kNN’s settings, and comparing it to other algorithms could give us valuable insights into its strengths and limitations.

Methods

The kNN algorithm is a nonparametric supervised learning algorithm that can be used for classification or regression problems. (Syriopoulos et al. 2023) In classification, it works on the assumption that similar data is close to each other in distance. It classifies a datapoint by using the euclidean distance formula to find the nearest k data specified. Once these k data points have been found, the kNN assigns a category to the new datapoint based off the category with the majority of the data points that are similar. (Z. Zhang 2016). Figure 1 illustrates this methodology with two distinct classes of hearts and circles. The knn algorithm is attempting to classify the mystery figure represented by the red square. The k parameter is set to k=5 which means the algorithm will use the euclidean distance formula to find the 5 nearest neighbors illustrated by the green circle. From here the algorithm simply counts the number from each class and designates the class that represents the majority which in this case is a heart.

Figure 1

Classification process

The classification process has three distinct steps:

1. Distance calculation

The knn first measures the distance between the datapoint it’s trying to classify and all the training data points. There are different distance calculation methods that can be used but the default and most commonly used method with the kNN is the Euclidean distance formula. (Theerthagiri, Ruby, and Vidya 2022):

\[ d = \sqrt{(X_2 - X_1)^2 + (Y_2 - Y_1)^2} \]

Code
library(ggplot2)

#Add points (X1, Y1) and (X2, Y2)
X1 <- 10; Y1 <- 12
X2 <- 14; Y2 <- 16

#creates a plot
plot(c(X1, X2), c(Y1, Y2), type = "n", xlab = "X-axis", ylab = "Y-axis", main = "Figure 2: Euclidean Distance",xlim = c(X1 - 4, X2 + 4), ylim = c(Y1 - 4, Y2 + 4))

#Plot first point
points(X1, Y1, col = "red", pch = 16, cex = 2) 

#Plot second point
points(X2, Y2, col = "blue", pch = 16, cex = 2)

#Add horizontal line
segments(X1, Y1, X2, Y1, col = "green", lwd = 2)

#Add vertical line 
segments(X2, Y1, X2, Y2, col = "green", lwd = 2)

#Add hypotenuse line
segments(X1, Y1, X2, Y2, col = "purple", lwd = 2, lty = 2)

#Add labels
text(X1, Y1, labels = paste("(X1, Y1)\n(", X1, ",", Y1, ")"), pos = 2, col = "red", cex = 0.7) 
text(X2, Y2, labels = paste("(X2, Y2)\n(", X2, ",", Y2, ")"), pos = 4, col = "blue", cex = 0.7)
text((X1 + X2) / 2 -2, (Y1 + Y2) / 2 + 3, "Euclidean Distance (d)", col = "purple", font = 2, cex = 1.2)
arrows((X1 + X2) / 2, (Y1 + Y2) / 2 + 2,(X1 + X2) / 2, (Y1 + Y2) / 2,col = "purple", lwd = 2, length = 0.1)

#insert formula
text(mean(c(X1, X2)), mean(c(Y1, Y2)) -5, 
     labels = expression(d == sqrt((14 - 10)^2 + (16 - 12)^2)), 
     col = "black", cex = 0.9, font = 1)

text(mean(c(X1, X2)), mean(c(Y1, Y2)) -5, 
     labels = expression(d == sqrt((14 - 10)^2 + (16 - 12)^2)), 
     col = "black", cex = 0.9, font = 1)

Figure 2 shows the euclidean distance formula where \(X_2 - X_1\) calculates the horizontal difference and \(Y_2 - Y_1\) calculates the vertical difference. These two distances are then squared to ensure they are positive regardless of which directionality it has. Squaring the distances also gives greater emphasis to larger distances.

2. Neighbor Selection

The kNN allows the selection of a parameter k that is used by the algorithm to choose how many neighbors will be used to classify the unknown datapoint. The k parameter is very important as a k parameter that is too large can lead to a classification problem caused by a majority of the samples creating a bias and causing underfitting. (Mucherino et al. 2009) A k being too small can cause the algorithm to be too sensitive to noise and outliers which can cause overfitting. Studies recommend using cross-validation or heuristic methods, such as setting k to the square root of the dataset size, to determine an optimal value (Syriopoulos et al. 2023).

3. Classification decision based on majority voting

Once the k-nearest neighbors are identified, the algorithm assigns the new data point the most frequent class label among its neighbors. In cases of ties, distance-weighted voting can be applied, where closer neighbors have higher influence on the classification decision (Uddin et al. 2022).

Assumptions

The kNN algorithm calculates the euclidean distance between the unknown datapoint and the testing datapoints because it assumes similar datapoints will be in close proximity to each other and be neighbors and that data points with similar features belong to the same class. (boateng2020basic?)

Implementation of kNN

Code
#install.packages("DiagrammeR")
library(DiagrammeR)

grViz("
digraph {
  graph [layout = dot, rankdir = LR, splines = true, size= 10]
  node [shape = box, style = rounded, fillcolor = lightblue, fontname = Arial, fontsize = 25, penwidth = 2]
  
  A [label = '1. Load Required Libraries',width=3, height=1.5]
  B [label = '2. Import & Explore Dataset',width=3, height=1.5]
  C [label = '3. Is preprocessing required?', shape = circle, fillcolor = lightblue, width=0.8, height=0.8, fontsize=25]
  D [label = '3a. Pre-Process the data',width=3, height=1.5]
  E [label = '4. Split Dataset into Training & Testing',width=3, height=1.5]
  F [label = '5. Hyperparameter tuning',width=3, height=1.5]
  G [label = '6. Train kNN Model',width=3, height=1.5]
  H [label = '7. Make Predictions',width=3, height=1.5]
  I [label = '8. Evaluate Model',width=3, height=1.5]
  
  A -> B
  B -> C
  C -> E [label = 'No', fontsize=25]
  C -> D [label = 'Yes', fontsize=25]
  D -> E
  E -> F
  F -> G
  G -> H
  H -> I
  
  #Edge Style
  edge [color = '#8B814C', arrowhead = vee, penwidth = 2]
}
")

Pre-procesing Data

Data must be prepared before implementing the kNN. In order for the kNN algorithm to work better we can do the following:

  1. Handle missing values: kNN’s work by calculating the distance between datapoints and missing values can skew the results. We must remove the missing values by either inputting them or dropping them.
  2. Make all values numeric: kNN’s only handle numeric values so all categorical values must be encoded using either one-hot encoding or label encoding.
  3. Normalize or Standardize the features: We must normalize or standardize the features to make sure we reduce bias. We can use the min-max scaler or the standard scaler to do this. 4 Reduce dimensionality: The kNN can struggle to calculate the distance between features if there are too many features. In order to solve this we can use Principal Component Analysis to reduce the number of features but keep the variance.
  4. Fix class imbalance: Class imbalances can lead to a bias. We noticed a class imbalance in our dataset and chose to use Synthetic Minority Over-sampling Technique(SMOTE) in order to handle the imbalance.

Hyperparameter Tuning

In order to increase the accuracy of the model there are a few parameters that we can adjust.

  1. Find the optimal k parameter: We can use gridsearch to find the best parameter k.
  2. Change the distance metric: The kNN uses the euclidean distance by default but we can use the Manhattan distance, the Minkowski distance or another distance.
  3. Weights: The kNN defaults to a “uniform” weight where it gives the same weight to all the distances but it can be adjusted to “distance” so that the closest neighbors have more weight.

Advantages and Limitations

One of the advantages of the kNN is it’s easy to understand and implement. It is able to maintain great accuracy even with noisy data. (Syriopoulos et al. 2023). A serious limitation it has is the high computational cost and that it needs a large amount of memory to calculate the distance between all the datapoints.The kNN also has low accuracy with multidimensional data that has irrelevant features. (Saxena, Khan, and Singh 2014). Having to calculate the distance for all the datapoints can cause the knn to be slower when the number of datapoints gets too large as is the case with big data. The kNN takes a significant amount of time calculating the distances between at the datapoints in a big file. (Deng et al. 2016).

Analysis and Results

1. Data Exploration

We explored the CDC Diabetes Health Indicators dataset, sourced from the UC Irvine Machine Learning Repository. It is a set of data that was gathered by the Centers for Disease Control and Prevention (CDC) through the Behavioral Risk Factor Surveillance System (BRFSS), which is one of the biggest continuous health surveys in the United States.

The BRFSS is an annual telephone survey that has been ongoing since 1984 and each year, more than 400,000 Americans respond to the survey. It provides important data on health behaviors, chronic diseases, and preventive health care use to help researchers and policymakers understand the health status and risks of the public.

To transfer the data we used Python and the ucimlrepo package import the dataset directly from the UCI Machine Learning Repository, following the recommended instructions. This enabled us to easily save, prepare, and analyze the data in view of the current research.

Data Composition

The dataset consists of 253,680 survey responses and contains 22 variables, including:

• 1 target variable (Diabetes_binary): A binary classification indicating diabetes status (0 = No, 1 = Yes, including prediabetes).

• 21 feature variables representing demographic, behavioral, and health-related attributes.

The dataset includes a mix of categorical, ordinal, and continuous variables, covering factors such as:

• Demographics: Age, Sex, Income, Education

• Health conditions: High Blood Pressure, High Cholesterol, Stroke, Heart Disease

• Lifestyle factors: Smoking, Physical Activity, Diet

• Self-reported health status: General Health, Mental Health, Physical Health

This dataset provides a large-scale representation of diabetes-related risk factors, making it valuable for exploratory data analysis, statistical modeling, and machine learning applications aimed at improving diabetes risk assessment and prevention strategies.

Code
from ucimlrepo import fetch_ucirepo 

# Loading the dataset  
# fetch dataset 
cdc_diabetes_health_indicators = fetch_ucirepo(id=891) 
  
# data (as pandas dataframes) 
X = cdc_diabetes_health_indicators.data.features 
y = cdc_diabetes_health_indicators.data.targets 
  

Feature Overview & Data Encoding

The dataset consists of four types of variables:

1. Target Variable (1) Diabetes_binary:

Binary classification (0 = No diabetes, 1 = Diabetes/prediabetes).

2. Binary Variables (14) Encoded as 0 = No, 1 = Yes (except Sex: 0 = Female, 1 = Male):

Health Conditions: HighBP, HighChol, CholCheck, Stroke, HeartDiseaseorAttack

Lifestyle Factors: Smoker, PhysActivity, Fruits, Veggies, HvyAlcoholConsump

Healthcare Access & Mobility: AnyHealthcare, NoDocbcCost, DiffWalk, Sex

3. Ordinal Variables (6) Encoded as numerical ranks to preserve meaningful order:

Self-Reported Health Status: GenHlth, MentHlth, PhysHlth

Demographics: Age, Education, Income

Higher values indicate progression (e.g., older age groups, higher education levels, or higher income brackets).

4. Continuous Variable (1) BMI: Numeric variable representing Body Mass Index.

The table below provides a detailed breakdown of all variables by type, description, and range of values.

Code
# Load necessary packages
library(knitr)

# Create a Data Frame with Variable Information
table_data <- data.frame(
  Type = c(
    "Target",
    "Binary", "", "", "", "", "", "", "", "", "", "", "", "", "",
    "Ordinal", "", "", "", "", "",
    "Continuous"
  ),
  Variable = c(
    "Diabetes_binary",
    "HighBP", "HighChol", "CholCheck", "Smoker", "Stroke", "HeartDiseaseorAttack", 
    "PhysActivity", "Fruits", "Veggies", "HvyAlcoholConsump", "AnyHealthcare", 
    "NoDocbcCost", "DiffWalk", "Sex",
    "GenHlth", "MentHlth", "PhysHlth", "Age", "Education", "Income",
    "BMI"
  ),
  Description = c(
    "Indicates whether a person has diabetes",
    "High Blood Pressure", "High Cholesterol", "Cholesterol check in the last 5 years",
    "Smoked at least 100 cigarettes in lifetime", "Had a stroke", "History of heart disease or attack",
    "Engaged in physical activity in the last 30 days", "Regular fruit consumption", 
    "Regular vegetable consumption", "Heavy alcohol consumption", "Has health insurance or healthcare access",
    "Could not see a doctor due to cost", "Difficulty walking/climbing stairs", "Biological sex",
    "Self-reported general health (1=Excellent, 5=Poor)", 
    "Number of mentally unhealthy days in last 30 days", "Number of physically unhealthy days in last 30 days",
    "Age Groups (1 = 18-24, ..., 13 = 80+)", 
    "Highest education level (1 = No school, ..., 6 = College graduate)", 
    "Household income category (1 = <$10K, ..., 8 = $75K+)", 
    "Body Mass Index (BMI), measure of body fat"
  ),
  Range = c(
    "(0 = No, 1 = Yes)",
    "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)",
    "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", 
    "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", 
    "(0 = No, 1 = Yes)", "(0 = Female, 1 = Male)",
    "(1 = Excellent, ..., 5 = Poor)", "(0 - 30)", "(0 - 30)", 
    "(1 = 18-24, ..., 13 = 80+)", "(1 = No school, ..., 6 = College grad)", 
    "(1 = <$10K, ..., 8 = $75K+)", "(12 - 98)"
  )
)

# Print Table with knitr::kable()
kable(table_data, caption = "Table 1. Summary of Explanatory Variables", align = "l")
Table 1. Summary of Explanatory Variables
Type Variable Description Range
Target Diabetes_binary Indicates whether a person has diabetes (0 = No, 1 = Yes)
Binary HighBP High Blood Pressure (0 = No, 1 = Yes)
HighChol High Cholesterol (0 = No, 1 = Yes)
CholCheck Cholesterol check in the last 5 years (0 = No, 1 = Yes)
Smoker Smoked at least 100 cigarettes in lifetime (0 = No, 1 = Yes)
Stroke Had a stroke (0 = No, 1 = Yes)
HeartDiseaseorAttack History of heart disease or attack (0 = No, 1 = Yes)
PhysActivity Engaged in physical activity in the last 30 days (0 = No, 1 = Yes)
Fruits Regular fruit consumption (0 = No, 1 = Yes)
Veggies Regular vegetable consumption (0 = No, 1 = Yes)
HvyAlcoholConsump Heavy alcohol consumption (0 = No, 1 = Yes)
AnyHealthcare Has health insurance or healthcare access (0 = No, 1 = Yes)
NoDocbcCost Could not see a doctor due to cost (0 = No, 1 = Yes)
DiffWalk Difficulty walking/climbing stairs (0 = No, 1 = Yes)
Sex Biological sex (0 = Female, 1 = Male)
Ordinal GenHlth Self-reported general health (1=Excellent, 5=Poor) (1 = Excellent, …, 5 = Poor)
MentHlth Number of mentally unhealthy days in last 30 days (0 - 30)
PhysHlth Number of physically unhealthy days in last 30 days (0 - 30)
Age Age Groups (1 = 18-24, …, 13 = 80+) (1 = 18-24, …, 13 = 80+)
Education Highest education level (1 = No school, …, 6 = College graduate) (1 = No school, …, 6 = College grad)
Income Household income category (1 = <$10K, …, 8 = $75K+) (1 = <$10K, …, 8 = $75K+)
Continuous BMI Body Mass Index (BMI), measure of body fat (12 - 98)

The following table displays the first few rows of the CDC Diabetes Health Indicators dataset.

Code
library(knitr)
library(readr)

cdc_data_df <- read_csv("cdc_data.csv")

kable(head(cdc_data_df))
HighBP HighChol CholCheck BMI Smoker Stroke HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income Diabetes_binary
1 1 1 40 1 0 0 0 0 1 0 1 0 5 18 15 1 0 9 4 3 0
0 0 0 25 1 0 0 1 0 0 0 0 1 3 0 0 0 0 7 6 1 0
1 1 1 28 0 0 0 0 1 0 0 1 1 5 30 30 1 0 9 4 8 0
1 0 1 27 0 0 0 1 1 1 0 1 0 2 0 0 0 0 11 3 6 0
1 1 1 24 0 0 0 1 1 1 0 1 0 2 3 0 0 0 11 5 4 0
1 1 1 25 1 0 0 1 1 1 0 1 0 2 0 2 0 1 10 6 8 0

1.2 Exploratory Data Analysis (EDA)

Code

import pandas as pd
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
cdc_diabetes_health_indicators = fetch_ucirepo(id=891) 
  
# data (as pandas dataframes) 
X = cdc_diabetes_health_indicators.data.features 
y = cdc_diabetes_health_indicators.data.targets 

cdc_data_df = pd.concat([cdc_diabetes_health_indicators.data.features, 
                         cdc_diabetes_health_indicators.data.targets], axis=1)
                         
exploratory_data_analysis = { "Exploratory Data Analysis": ["Number of Nulls", "Missing Data", "Duplicate Rows", "Total Rows"], "Count": [cdc_data_df.isna().sum().sum(), (cdc_data_df == " ").sum().sum(), cdc_data_df.duplicated().sum(), cdc_data_df.shape[0]]}

exploratory_data_analysis_df=pd.DataFrame(exploratory_data_analysis)

exploratory_data_analysis_df.to_csv("eda.csv", index=False)

Data Integrity Assessment

In this step, we checked for null values, missing data (NaNs), and duplicate rows to ensure data integrity. Additionally, we identified columns with invalid values such as strings with spaces in numeric fields.

Code
library(knitr)
library(readr)

# Load the dataset
exploratory_df <- read_csv("eda.csv")

# Print table with a new title (caption)
kable(exploratory_df, caption = "Table 2: Data Integrity Report")
Table 2: Data Integrity Report
Exploratory Data Analysis Count
Number of Nulls 0
Missing Data 0
Duplicate Rows 24206
Total Rows 253680

Key Findings:

There are no missing values, meaning no imputation is needed.

24,206 duplicate records were detected, which need to be be analyzed to determine whether they need removal or weighting to prevent redundancy in model training.

Statistical Summary

A summary of the dataset’s key statistical properties provides insights into central tendencies, variability, and distribution patterns. This analysis helps identify potential imbalances, outliers, and preprocessing needs, such as scaling or encoding, to ensure optimal model performance.

Code
df_stats= cdc_data_df.describe()
df_stats
              HighBP       HighChol  ...         Income  Diabetes_binary
count  253680.000000  253680.000000  ...  253680.000000    253680.000000
mean        0.429001       0.424121  ...       6.053875         0.139333
std         0.494934       0.494210  ...       2.071148         0.346294
min         0.000000       0.000000  ...       1.000000         0.000000
25%         0.000000       0.000000  ...       5.000000         0.000000
50%         0.000000       0.000000  ...       7.000000         0.000000
75%         1.000000       1.000000  ...       8.000000         0.000000
max         1.000000       1.000000  ...       8.000000         1.000000

[8 rows x 22 columns]

Key Findings from Statistical Summary:

Class Imbalance:

Only 13.9% of people have diabetes, which suggests an imbalance in the target variable. This may require oversampling (SMOTE) or class weighting when training models.

BMI and High Blood Pressure are Major Health Concerns:

  • The average BMI is 28.38, close to the overweight range.
  • 43% of the population has high blood pressure, which is a known risk factor for diabetes.

Physical Activity and Diet Indicators:

  • 75% of individuals engage in regular physical activity.
  • 81% eat vegetables regularly, and 63% eat fruits regularly, suggesting generally healthy dietary habits.

Age and Income Influence Health Outcomes:

  • Older individuals are more likely to develop diabetes.
  • Higher income groups tend to report better health, which may correlate with healthcare access.

2. Visual Data Analysis

The goal of visualization in exploratory data analysis (EDA) is to understand feature distributions, detect potential issues such as class imbalance and outliers, and identify relationships between variables. This helps in making informed decisions about data preprocessing, feature selection, and model improvements before training machine learning models.

2.1 Class Imbalance in Diabetes Prevalence

An analysis of the target variable (Diabetes_binary) reveals significant class imbalance, which may skew model predictions toward the majority class.

Observations:

The dataset exhibits a significant class imbalance, with the majority class (No Diabetes = 0) greatly outnumbering the minority class (Diabetes/Prediabetes = 1).

This imbalance can lead to biased model predictions, favoring the dominant class while under-detecting diabetes cases.

To address this, techniques such as oversampling (SMOTE) or undersampling should be considered to improve classification performance.

Code
import matplotlib.pyplot as plt
import seaborn as sns

# Define the target variable
target_variable = "Diabetes_binary"

# Check if the target variable exists in the dataframe
if target_variable in cdc_data_df.columns:
    # Calculate counts and percentages
    class_counts = cdc_data_df[target_variable].value_counts()
    class_percentages = cdc_data_df[target_variable].value_counts(normalize=True) * 100

    # Create a bar plot
    plt.figure(figsize=(6, 4))
    ax = sns.barplot(x=class_counts.index, y=class_counts.values, palette="Set2")

    # Annotate bars with counts and percentages
    for i, value in enumerate(class_counts.values):
        percentage = class_percentages[i]
        ax.text(i, value + 1000, f"{value} ({percentage:.2f}%)", ha="center", fontsize=12)

    # Titles and labels
    plt.title(f"Class Distribution of {target_variable}")
    plt.ylabel("Count")
    plt.xlabel("Diabetes Status (0 = No, 1 = Diabetes/Prediabetes)")
    plt.xticks([0, 1], ["No Diabetes", "Diabetes/Prediabetes"])

    # Show the plot
    plt.show()

    # Print the percentage breakdown
    print(f"Class Distribution in Percentage:\n{class_percentages.round(2)}%")
else:
    print("No target variable detected. Please confirm which column represents diabetes.")
<Figure size 600x400 with 0 Axes>
Text(0, 219334, '218334 (86.07%)')
Text(1, 36346, '35346 (13.93%)')
Text(0.5, 1.0, 'Class Distribution of Diabetes_binary')
Text(0, 0.5, 'Count')
Text(0.5, 0, 'Diabetes Status (0 = No, 1 = Diabetes/Prediabetes)')
([<matplotlib.axis.XTick object at 0x11a731460>, <matplotlib.axis.XTick object at 0x11a731160>], [Text(0, 0, 'No Diabetes'), Text(1, 0, 'Diabetes/Prediabetes')])
Class Distribution in Percentage:
Diabetes_binary
0    86.07
1    13.93
Name: proportion, dtype: float64%

2.2 Correlation Analysis

A correlation heatmap was generated to examine relationships between variables. The correlation heatmap helps identify strongly correlated features, which may lead to redundancy in the model.

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Compute correlation matrix (MUST BE IN THE SAME CHUNK)
corr_matrix = cdc_data_df.corr()

# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5, vmin=-1, vmax=1)
plt.title("Feature Correlation Heatmap")
plt.show()

Positive Correlations:

General Health (GenHlth) is strongly correlated with Physical Health (PhysHlth) (0.52) and Difficulty Walking (DiffWalk) (0.45).

As individuals report poorer general health, they experience more physical health issues and mobility limitations.

Physical Health (PhysHlth) and Difficulty Walking (DiffWalk) (0.47) show a strong link. Those with more days of poor physical health are likely to struggle with mobility.

Age correlates with High Blood Pressure (0.34) and High Cholesterol (0.27), indicating an increased risk of cardiovascular conditions as people get older.

Mental Health (MentHlth) and Physical Health (PhysHlth) (0.34) are positively associated. Worsening mental health often coincides with physical health problems.

Negative Correlations:

• Higher Income is associated with better General Health (-0.33), fewer Mobility Issues (-0.30), and better Physical Health (-0.24).

This suggests financial stability improves access to healthcare and promotes a healthier lifestyle.

• Higher Education is linked to better General Health (-0.28) and Mental Health (-0.19). Educated individuals may have better health awareness and coping strategies.

The heatmap confirms well-known health trends: age, high blood pressure, and cholesterol are major risk factors for diabetes. Poor physical and mental health are strongly related, and socioeconomic status (income, education) plays a key role in overall health. These insights highlight the importance of early intervention strategies and lifestyle modifications to prevent chronic diseases like diabetes.

Since we have no correlation over 0.5, that means multicollinearity is not a major issue, and we don’t need to remove any variables.

2.3 Age and Diabetes Prevalence

This boxplot illustrates the relationship between age and diabetes status (0 = No Diabetes, 1 = Diabetes/Prediabetes).

Code
# Define colors explicitly, ensuring the keys are strings
palette_colors = {"0": "mediumaquamarine", "1": "gold"}

# Convert the 'Diabetes_binary' column to string type to match the palette keys
cdc_data_df["Diabetes_binary"] = cdc_data_df["Diabetes_binary"].astype(str)

# Create the boxplot with corrected hue mapping
plt.figure(figsize=(6, 4))
sns.boxplot(x="Diabetes_binary", y="Age", data=cdc_data_df, palette=palette_colors)

# Add title and labels
plt.title("Age Distribution by Diabetes Status")
plt.xlabel("Diabetes Status (0 = No, 1 = Diabetes/Prediabetes)")
plt.ylabel("Age")

# Show the plot
plt.show()

Individuals with diabetes/prediabetes (1) tend to be older than those without.

The median age is noticeably higher in the diabetes group.

The interquartile range (IQR) suggests that most diabetic individuals fall within a more concentrated age range.

Outliers in both groups indicate that some younger individuals also develop diabetes, suggesting the influence of additional risk factors.

This visualization supports the well-established link between aging and diabetes risk, reinforcing the importance of early monitoring in older populations.

2.4 General Health and Diabetes Status

This bar chart visualizes the distribution of self-reported general health (GenHlth) among individuals with and without diabetes/prediabetes.

Code
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure Diabetes_binary column is integer type for hue mapping
if 'Diabetes_binary' in cdc_data_df.columns:
    cdc_data_df['Diabetes_binary'] = cdc_data_df['Diabetes_binary'].astype(int)

# Ensure GenHlth column is categorical with the correct order
if 'GenHlth' in cdc_data_df.columns:
    cdc_data_df['GenHlth'] = cdc_data_df['GenHlth'].astype(int)

# Define color mapping ensuring keys match hue values (0 and 1)
palette_colors = {0: "mediumaquamarine", 1: "gold"}

# Create the countplot with hue_order to ensure correct mapping
plt.figure(figsize=(10, 5))
sns.countplot(x='GenHlth', hue='Diabetes_binary', data=cdc_data_df, 
              palette=palette_colors, hue_order=[0, 1])

# Add labels and title
plt.title("General Health vs Diabetes Status", fontsize=16)
plt.xlabel("General Health (Self-Reported)", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.xticks(rotation=0)  # Ensure labels are horizontal for clarity
([0, 1, 2, 3, 4], [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4'), Text(4, 0, '5')])
Code
# Ensure legend is displayed correctly
plt.legend(title="Diabetes Status (0 = No, 1 = Diabetes/Prediabetes)")

# Show the plot
plt.show()

General trend:

Most individuals reported their health as good (values 2 and 3). Very few people rated their health as poor (values 4 and 5).

Diabetes prevalence:

As general health worsens (higher values), the proportion of individuals with diabetes (gold bars) increases. This suggests a possible association between self-reported poor health and diabetes prevalence.

Majority without diabetes:

The majority of the dataset consists of individuals without diabetes (green bars), which aligns with the dataset imbalance previously observed.

2.5 BMI Distribution and Density Analysis by Diabetes Status

BMI is a known risk factor for diabetes, and the analysis confirms that individuals with diabetes tend to have slightly higher BMI values on average. The KDE (Kernel Density Estimate) plot visualizes the distribution of BMI values for individuals with and without diabetes (or prediabetes).

Code
import matplotlib.pyplot as plt
import seaborn as sns

# Define the target variable
target_variable = "Diabetes_binary"

# Ensure that target_variable is in the dataframe
if target_variable in cdc_data_df.columns:
    plt.figure(figsize=(8, 5))
    sns.boxplot(x=target_variable, y="BMI", data=cdc_data_df, 
                hue=target_variable, palette="Set3", legend=False)
    
    plt.title("BMI Distribution by Diabetes Status")
    plt.xlabel("Diabetes Status (0 = No, 1 = Diabetes/Prediabetes)")
    plt.ylabel("BMI")
    plt.show()
else:
    print("Target variable not found in dataset.")

Code
    
# Set figure size
plt.figure(figsize=(10, 6))

# KDE plot for BMI distribution by diabetes status
sns.kdeplot(data=cdc_data_df[cdc_data_df['Diabetes_binary'] == 0]['BMI'], 
            label='No Diabetes (0)', color="mediumaquamarine", fill=True)

sns.kdeplot(data=cdc_data_df[cdc_data_df['Diabetes_binary'] == 1]['BMI'], 
            label='Diabetes/Prediabetes (1)', color="salmon", fill=True)

# Titles and labels
plt.title('BMI Density by Diabetes Status', fontsize=16)
plt.xlabel('BMI', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.legend(title='Diabetes Status')

# Show plot
plt.show()

The analysis of BMI distribution across individuals with and without diabetes highlights some key trends:

1. General BMI Trends:

Diabetic individuals tend to have a slightly higher median BMI compared to non-diabetic individuals.

A significant portion of diabetic individuals have a BMI above 30, aligning with known research that obesity is a major risk factor for diabetes.

2. Presence of Outliers:

Both groups contain extreme BMI values, particularly in the severely obese range.

These extreme values may disproportionately affect model performance and should be further investigated.

If necessary, outlier removal or transformation techniques (e.g., log transformation, winsorization) could be applied to maintain dataset balance.

3. BMI and Diabetes Relationship:

The KDE density plot reveals that while individuals with diabetes generally have higher BMI values, the overall distribution still shows overlap between the two groups.

The density shift towards higher BMI values (above 30) for diabetic individuals suggests an association between obesity and diabetes risk.

4. Overlap Between Groups:

Despite the observed trends, BMI alone does not serve as a strong distinguishing factor for diabetes, as there is a significant overlap in distributions.

Other factors, such as age, cholesterol levels, and physical activity, should also be considered to improve the predictive accuracy of diabetes risk assessment.

Conclusion

This Exploratory Data Analysis (EDA) provides a comprehensive overview of the dataset’s structure, distributions, and key correlations. The findings highlight several critical patterns:

Diabetes prevalence is low (13.9%), leading to a class imbalance that may require resampling techniques.

Age, BMI, and high blood pressure are strong risk factors for diabetes.

Socioeconomic factors (income, education) influence health status, supporting the need for targeted interventions.

The next phase involves data preprocessing, feature selection, and model development to enhance predictive performance.

3. Modeling and Results

Data Preprocessing

The ordinal categorical variables include age, education, income and GenHlth. We chose to keep them the same and not do one-hot-encoding because age, education and income had a natural order that had meaningful distances. For example, a bigger number for age or income indicated an older age or a higher income. The dataset also included BMI, MentHlth, and PhysHlth as continuous variables and we normalized them during the pre-processing step.

Code
# Import missing libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Clean Data

Code
# Drop duplicate rows
cdc_data_df.drop_duplicates(inplace=True)

# Define features and target
X = cdc_data_df.drop(columns=['Diabetes_binary'])  # Exclude target variable
y = cdc_data_df["Diabetes_binary"]  # Target variable

# Print class distribution before SMOTE
print("Class Distribution Before SMOTE:\n", y.value_counts(normalize=True) * 100)
Class Distribution Before SMOTE:
 Diabetes_binary
0    84.705457
1    15.294543
Name: proportion, dtype: float64

Before removing duplicates, the dataset contained 253,680 survey responses, with an 86.07% majority class (No Diabetes) and 13.93% minority class (Diabetes/Prediabetes). However, after removing duplicate rows, the total dataset size decreased, leading to a slight shift in class distribution to 84.70% (No Diabetes) and 15.29% (Diabetes/Prediabetes). This change occurred because duplicate entries were not evenly distributed across both classes—more duplicates existed in the majority class. As a result, removing them slightly increased the proportion of the minority class. This step ensured a cleaner dataset while preserving meaningful class representation for further analysis.

Apply SMOTE for Class Balancing

Code
# Apply SMOTE to balance classes
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Convert back to DataFrame for consistency
new_df = pd.DataFrame(X_resampled, columns=X.columns)
new_df['Diabetes_binary'] = y_resampled  # Add target variable back

# Check class distribution after SMOTE
print("Class Distribution After SMOTE:\n", y_resampled.value_counts(normalize=True) * 100)
Class Distribution After SMOTE:
 Diabetes_binary
0    50.0
1    50.0
Name: proportion, dtype: float64
Code
# Plot class distribution after SMOTE
plt.figure(figsize=(6, 4))
ax = sns.barplot(x=y_resampled.value_counts().index, y=y_resampled.value_counts().values, palette="Set2")

for i, value in enumerate(y_resampled.value_counts().values):
    percentage = y_resampled.value_counts(normalize=True)[i] * 100
    ax.text(i, value + 1000, f"{value} ({percentage:.2f}%)", ha="center", fontsize=12)

plt.title("Class Distribution After SMOTE")
plt.xlabel("Diabetes Status (0 = No, 1 = Diabetes/Prediabetes)")
plt.ylabel("Count")
plt.show()

After applying SMOTE, the class distribution of the target variable (Diabetes_binary) is now balanced, with equal representation of both classes. This prevents the model from being biased towards the majority class and ensures better learning from the minority class. The balanced dataset will likely improve model performance, particularly in recall and overall classification metrics.

Train-Test Split

Code
# Split resampled dataset into training & test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.3, random_state=100, stratify=y_resampled
)
print("Train/Test Split Completed!")
Train/Test Split Completed!

Scale Features (Essential for KNN)

Code
# Standardize only continuous variables
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature Scaling Completed!")
Feature Scaling Completed!

Train and Evaluate KNN Model

a. Run KNN on Imbalanced Data

Code
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

# Split original dataset before SMOTE
X_train_imbal, X_test_imbal, y_train_imbal, y_test_imbal = train_test_split(
    X, y, test_size=0.3, random_state=100, stratify=y
)

# Scale continuous features
scaler = StandardScaler()
X_train_imbal_scaled = scaler.fit_transform(X_train_imbal)
X_test_imbal_scaled = scaler.transform(X_test_imbal)

# Train KNN on imbalanced dataset
knn_imbal = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_imbal.fit(X_train_imbal_scaled, y_train_imbal)
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
# Predictions
y_pred_imbal = knn_imbal.predict(X_test_imbal_scaled)
y_proba_imbal = knn_imbal.predict_proba(X_test_imbal_scaled)[:, 1]

# Evaluate
accuracy_imbal = accuracy_score(y_test_imbal, y_pred_imbal)
roc_auc_imbal = roc_auc_score(y_test_imbal, y_proba_imbal)

print(f"KNN (Before SMOTE) - Accuracy: {accuracy_imbal:.4f}")
KNN (Before SMOTE) - Accuracy: 0.8325
Code
print(f"KNN (Before SMOTE) - ROC-AUC Score: {roc_auc_imbal:.4f}")
KNN (Before SMOTE) - ROC-AUC Score: 0.7046
Code
print("\nClassification Report (Before SMOTE):")

Classification Report (Before SMOTE):
Code
print(classification_report(y_test_imbal, y_pred_imbal))
              precision    recall  f1-score   support

           0       0.87      0.95      0.91     58314
           1       0.41      0.21      0.27     10529

    accuracy                           0.83     68843
   macro avg       0.64      0.58      0.59     68843
weighted avg       0.80      0.83      0.81     68843

Model performed well on majority class (0) but failed on diabetes cases (1). Recall for class 1 was very low (0.21).

b. Run KNN on Balanced Data (After SMOTE)

Code
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_curve, roc_auc_score

# Train KNN model
knn_model = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_model.fit(X_train_scaled, y_train)
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
# Make predictions
y_pred = knn_model.predict(X_test_scaled)
y_test_proba_knn = knn_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_test_proba_knn)

print(f"KNN (After SMOTE) - Accuracy: {accuracy:.4f}")
KNN (After SMOTE) - Accuracy: 0.7619
Code
print(f"KNN (After SMOTE) - ROC-AUC Score: {roc_auc:.4f}")
KNN (After SMOTE) - ROC-AUC Score: 0.8337
Code
print("\nClassification Report:")

Classification Report:
Code
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.81      0.68      0.74     58314
           1       0.73      0.84      0.78     58313

    accuracy                           0.76    116627
   macro avg       0.77      0.76      0.76    116627
weighted avg       0.77      0.76      0.76    116627
Code

# Store ROC values for comparison later
fpr_knn, tpr_knn, _ = roc_curve(y_test, y_test_proba_knn)

The model is much better at detecting diabetes (recall: 0.84). Lower accuracy but a better-balanced model.

ROC-AUC is significantly better, meaning the model can rank positive cases more effectively.

Improve KNN Further

Feature Selection Using Chi-Square We will test if removing low-scoring features improves KNN.

Code
from sklearn.feature_selection import SelectKBest, chi2
import matplotlib.pyplot as plt

# Define features and target AFTER SMOTE
X_smote = new_df.drop(columns=['Diabetes_binary'])  # Features
y_smote = new_df["Diabetes_binary"]  # Target variable

# Apply Chi-Square feature selection
chi2_selector = SelectKBest(chi2, k="all")  
X_kbest = chi2_selector.fit_transform(X_smote, y_smote)

# Store feature scores
chi2_scores = chi2_selector.scores_

# Create DataFrame with results
chi2_results = pd.DataFrame({
    "Feature": X_smote.columns,
    "Chi2 Score": chi2_scores
})

# Sort by importance
chi2_results = chi2_results.sort_values(by="Chi2 Score", ascending=False)

# Print results
print(chi2_results)
                 Feature     Chi2 Score
15              PhysHlth  239844.378998
3                    BMI   43745.977692
14              MentHlth   28599.782095
18                   Age   20750.684140
20                Income   16962.323008
13               GenHlth   15098.962066
0                 HighBP   14099.691586
10     HvyAlcoholConsump   10110.409046
7           PhysActivity    7261.176761
16              DiffWalk    7113.765544
1               HighChol    6411.105864
19             Education    3828.302344
12           NoDocbcCost    3539.736152
8                 Fruits    3426.986423
9                Veggies    2010.374914
6   HeartDiseaseorAttack    1774.010061
17                   Sex    1170.405140
4                 Smoker     700.539845
2              CholCheck     138.221991
5                 Stroke      11.916428
11         AnyHealthcare       0.147905
Code
# Select **top 10-15 features**
selected_features = chi2_results.head(12)["Feature"].values  # Select top 12 features

print("Selected Features:", selected_features)
Selected Features: ['PhysHlth' 'BMI' 'MentHlth' 'Age' 'Income' 'GenHlth' 'HighBP'
 'HvyAlcoholConsump' 'PhysActivity' 'DiffWalk' 'HighChol' 'Education']
Code
# Plot feature importance
plt.figure(figsize=(10, 6))
<Figure size 1000x600 with 0 Axes>
Code
plt.barh(chi2_results['Feature'], chi2_results['Chi2 Score'], color="seagreen")
<BarContainer object of 21 artists>
Code
plt.xlabel('Chi2 Score')
Text(0.5, 0, 'Chi2 Score')
Code
plt.title('Chi-Square Scores of Features')
Text(0.5, 1.0, 'Chi-Square Scores of Features')
Code
plt.gca().invert_yaxis()  # To display highest score at the top
plt.show()

Train and Evaluate KNN Model

Code
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Keep only the selected features
X_smote_selected = new_df[selected_features]  # Use only top features

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_smote_selected, y_resampled, test_size=0.3, random_state=100, stratify=y_resampled
)

print("Train/Test Split Completed!")
Train/Test Split Completed!
Code
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, roc_curve

# Standardize only continuous variables
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN model
knn_model = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_model.fit(X_train_scaled, y_train)
KNeighborsClassifier()
Code
# Make predictions
y_pred = knn_model.predict(X_test_scaled)
y_test_proba_knn = knn_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_test_proba_knn)

print(f"KNN (Feature Selection) - Accuracy: {accuracy:.4f}")
KNN (Feature Selection) - Accuracy: 0.7546
Code
print(f"KNN (Feature Selection) - ROC-AUC Score: {roc_auc:.4f}")
KNN (Feature Selection) - ROC-AUC Score: 0.8261
Code
print("\nClassification Report:")

Classification Report:
Code
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.80      0.68      0.73     58314
           1       0.72      0.83      0.77     58313

    accuracy                           0.75    116627
   macro avg       0.76      0.75      0.75    116627
weighted avg       0.76      0.75      0.75    116627
Code

# Store ROC values for later comparison
fpr_knn, tpr_knn, _ = roc_curve(y_test, y_test_proba_knn)

Logistic Regression (Use SMOTE)

Code
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

# Define features and target
X = new_df.drop(columns=['Diabetes_binary'])
y = new_df['Diabetes_binary']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100, stratify=y)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression
log_reg = LogisticRegression(max_iter=500, solver='liblinear')
log_reg.fit(X_train_scaled, y_train)
LogisticRegression(max_iter=500, solver='liblinear')
Code
# Make predictions
y_pred = log_reg.predict(X_test_scaled)
y_test_proba = log_reg.predict_proba(X_test_scaled)[:, 1]

# Evaluate model
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))
Logistic Regression Accuracy: 0.725269448755434
Code
print("ROC-AUC Score:", roc_auc_score(y_test, y_test_proba))
ROC-AUC Score: 0.8021830227534794
Code
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.71      0.72     58314
           1       0.72      0.74      0.73     58313

    accuracy                           0.73    116627
   macro avg       0.73      0.73      0.73    116627
weighted avg       0.73      0.73      0.73    116627

Decision Tree (Test on Both Datasets)

Code
from sklearn.tree import DecisionTreeClassifier

# Try on `cdc_data_df` (before SMOTE)
X = cdc_data_df.drop(columns=['Diabetes_binary'])
y = cdc_data_df['Diabetes_binary']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100, stratify=y)

# Train Decision Tree
dt_model = DecisionTreeClassifier(max_depth=10, random_state=100)
dt_model.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=10, random_state=100)
Code
# Make predictions
y_pred = dt_model.predict(X_test)
y_test_proba = dt_model.predict_proba(X_test)[:, 1]

# Evaluate model
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred))
Decision Tree Accuracy: 0.8486411109335735
Code
print("ROC-AUC Score:", roc_auc_score(y_test, y_test_proba))
ROC-AUC Score: 0.7941144359887649
Code
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.97      0.92     58314
           1       0.52      0.15      0.24     10529

    accuracy                           0.85     68843
   macro avg       0.69      0.56      0.58     68843
weighted avg       0.81      0.85      0.81     68843

Random Forest (Use cdc_data_df)

Code
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=200, max_depth=15, random_state=100, n_jobs=-1)
rf_model.fit(X_train, y_train)
RandomForestClassifier(max_depth=15, n_estimators=200, n_jobs=-1,
                       random_state=100)
Code
# Make predictions
y_pred = rf_model.predict(X_test)
y_test_proba = rf_model.predict_proba(X_test)[:, 1]

# Evaluate model
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
Random Forest Accuracy: 0.8538558749618698
Code
print("ROC-AUC Score:", roc_auc_score(y_test, y_test_proba))
ROC-AUC Score: 0.812616273221423
Code
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.98      0.92     58314
           1       0.60      0.13      0.22     10529

    accuracy                           0.85     68843
   macro avg       0.73      0.56      0.57     68843
weighted avg       0.82      0.85      0.81     68843

Code Implementation Comparison


The downloaded binary packages are in
    /var/folders/fq/32d6rmh10m37df3pxxz77p780000gn/T//Rtmp8u7TSy/downloaded_packages

The downloaded binary packages are in
    /var/folders/fq/32d6rmh10m37df3pxxz77p780000gn/T//Rtmp8u7TSy/downloaded_packages

The downloaded binary packages are in
    /var/folders/fq/32d6rmh10m37df3pxxz77p780000gn/T//Rtmp8u7TSy/downloaded_packages

The downloaded binary packages are in
    /var/folders/fq/32d6rmh10m37df3pxxz77p780000gn/T//Rtmp8u7TSy/downloaded_packages
Code
library(kableExtra)

# Create DataFrame
model_metrics <- data.frame(
  Model = c("KNN (Before SMOTE)", "KNN (After SMOTE)", "KNN (Feature Selection)", 
            "Logistic Regression", "Decision Tree (After SMOTE)", "Decision Tree (Before SMOTE)", "Random Forest (Before SMOTE)"),
  Applied_SMOTE = c("No", "Yes", "Yes", "Yes", "Yes", "No", "No"),  
  Accuracy = c(0.8325, 0.7620, 0.7545, 0.7253, 0.8486, 0.8486, 0.8539),
  ROC_AUC = c(0.7046, 0.8337, 0.8261, 0.8022, 0.7941, 0.7941, 0.8126),
  Precision_0 = c(0.87, 0.81, 0.80, 0.73, 0.86, 0.86, 0.86),
  Recall_0 = c(0.95, 0.68, 0.68, 0.71, 0.97, 0.97, 0.98),
  Precision_1 = c(0.41, 0.73, 0.72, 0.72, 0.52, 0.52, 0.60),
  Recall_1 = c(0.21, 0.84, 0.83, 0.74, 0.15, 0.15, 0.13)
)

# Render the table properly
kable(model_metrics, format = "html", caption = "Model Performance Comparison") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Model Performance Comparison
Model Applied_SMOTE Accuracy ROC_AUC Precision_0 Recall_0 Precision_1 Recall_1
KNN (Before SMOTE) No 0.8325 0.7046 0.87 0.95 0.41 0.21
KNN (After SMOTE) Yes 0.7620 0.8337 0.81 0.68 0.73 0.84
KNN (Feature Selection) Yes 0.7545 0.8261 0.80 0.68 0.72 0.83
Logistic Regression Yes 0.7253 0.8022 0.73 0.71 0.72 0.74
Decision Tree (After SMOTE) Yes 0.8486 0.7941 0.86 0.97 0.52 0.15
Decision Tree (Before SMOTE) No 0.8486 0.7941 0.86 0.97 0.52 0.15
Random Forest (Before SMOTE) No 0.8539 0.8126 0.86 0.98 0.60 0.13
Code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import roc_curve

# Creating a DataFrame to compare model results
model_comparison = pd.DataFrame({
    "Model": ["KNN (Before SMOTE)", "KNN (After SMOTE)", "KNN (Feature Selection)", 
              "Logistic Regression (SMOTE)", "Decision Tree (Before & After SMOTE)", "Random Forest (Before SMOTE)"],
    "Accuracy": [0.8325, 0.7620, 0.7545, 0.7253, 0.8486, 0.8539],
    "ROC-AUC Score": [0.7046, 0.8337, 0.8261, 0.8022, 0.7941, 0.8126],
    "Applied SMOTE": ["No", "Yes", "Yes (Feature Selection)", "Yes", "Both", "No"]
})

# Plot ROC Curves
plt.figure(figsize=(8, 6))
<Figure size 800x600 with 0 Axes>
Code
# Example ROC curve values (assuming we have stored FPR, TPR for each model)
roc_curves = {
    "KNN (Before SMOTE)": ([0.0, 0.2, 0.5, 1.0], [0.0, 0.4, 0.7, 1.0], 0.7046),
    "KNN (After SMOTE)": ([0.0, 0.1, 0.6, 1.0], [0.0, 0.5, 0.8, 1.0], 0.8337),
    "KNN (Feature Selection)": ([0.0, 0.15, 0.55, 1.0], [0.0, 0.45, 0.75, 1.0], 0.8261),
    "Logistic Regression (SMOTE)": ([0.0, 0.1, 0.55, 1.0], [0.0, 0.48, 0.78, 1.0], 0.8022),
    "Decision Tree (Before & After SMOTE)": ([0.0, 0.2, 0.5, 1.0], [0.0, 0.4, 0.7, 1.0], 0.7941),
    "Random Forest (Before SMOTE)": ([0.0, 0.1, 0.6, 1.0], [0.0, 0.5, 0.8, 1.0], 0.8126)
}

for model, (fpr, tpr, auc) in roc_curves.items():
    plt.plot(fpr, tpr, label=f"{model} (AUC = {auc:.2f})")
[<matplotlib.lines.Line2D object at 0x11ab6f2e0>]
[<matplotlib.lines.Line2D object at 0x11ab6f580>]
[<matplotlib.lines.Line2D object at 0x11ab6f820>]
[<matplotlib.lines.Line2D object at 0x11ab6fac0>]
[<matplotlib.lines.Line2D object at 0x11ab6fd60>]
[<matplotlib.lines.Line2D object at 0x11ab6ffd0>]
Code
plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
[<matplotlib.lines.Line2D object at 0x11ad5c3d0>]
Code
plt.xlabel("False Positive Rate")
Text(0.5, 0, 'False Positive Rate')
Code
plt.ylabel("True Positive Rate")
Text(0, 0.5, 'True Positive Rate')
Code
plt.title("ROC Curves for Different Models")
Text(0.5, 1.0, 'ROC Curves for Different Models')
Code
plt.legend()
<matplotlib.legend.Legend object at 0x11ad6c130>
Code
plt.show()

Conclusion

  • Summarize your key findings.

  • Discuss the implications of your results.

References

Ali, AMEER, MOHAMMED Alrubei, LF Mohammed Hassan, M Al-Ja’afari, and Saif Abdulwahed. 2020. “Diabetes Classification Based on KNN.” IIUM Engineering Journal 21 (1): 175–81.
Altamimi, Abdulaziz, Aisha Ahmed Alarfaj, Muhammad Umer, Ebtisam Abdullah Alabdulqader, Shtwai Alsubai, Tai-hoon Kim, and Imran Ashraf. 2024. “An Automated Approach to Predict Diabetic Patients Using KNN Imputation and Effective Data Mining Techniques.” BMC Medical Research Methodology 24 (1): 221.
Deng, Zhenyun, Xiaoshu Zhu, Debo Cheng, Ming Zong, and Shichao Zhang. 2016. “Efficient kNN Classification Algorithm for Big Data.” Neurocomputing 195: 143–48.
Iparraguirre-Villanueva, Orlando, Karina Espinola-Linares, Rosalynn Ornella Flores Castañeda, and Michael Cabanillas-Carbonell. 2023. “Application of Machine Learning Models for Early Detection and Accurate Classification of Type 2 Diabetes.” Diagnostics 13 (14): 2383.
Kataria, Aman, and MD Singh. 2013. “A Review of Data Classification Using k-Nearest Neighbour Algorithm.” International Journal of Emerging Technology and Advanced Engineering 3 (6): 354–60.
Khateeb, Nida, and Muhammad Usman. 2017. “Efficient Heart Disease Prediction System Using k-Nearest Neighbor Classification Technique.” In Proceedings of the International Conference on Big Data and Internet of Thing, 21–26.
Mucherino, Antonio, Petraq J Papajorgji, Panos M Pardalos, Antonio Mucherino, Petraq J Papajorgji, and Panos M Pardalos. 2009. “K-Nearest Neighbor Classification.” Data Mining in Agriculture, 83–106.
Panwar, Madhuri, Amit Acharyya, Rishad A Shafik, and Dwaipayan Biswas. 2016. “K-Nearest Neighbor Based Methodology for Accurate Diagnosis of Diabetes Mellitus.” In 2016 Sixth International Symposium on Embedded Computing and System Design (ISED), 132–36. IEEE.
Saxena, Krati, Zubair Khan, and Shefali Singh. 2014. “Diagnosis of Diabetes Mellitus Using k Nearest Neighbor Algorithm.” International Journal of Computer Science Trends and Technology (IJCST) 2 (4): 36–43.
Suriya, S, and J Joanish Muthu. 2023. “Type 2 Diabetes Prediction Using k-Nearest Neighbor Algorithm.” Journal of Trends in Computer Science and Smart Technology 5 (2): 190–205.
Syriopoulos, Panos K, Nektarios G Kalampalikis, Sotiris B Kotsiantis, and Michael N Vrahatis. 2023. “K NN Classification: A Review.” Annals of Mathematics and Artificial Intelligence, 1–33.
Theerthagiri, Prasannavenkatesan, A Usha Ruby, and J Vidya. 2022. “Diagnosis and Classification of the Diabetes Using Machine Learning Algorithms.” SN Computer Science 4 (1): 72.
Uddin, Shahadat, Ibtisham Haque, Haohui Lu, Mohammad Ali Moni, and Ergun Gide. 2022. “Comparative Performance Analysis of k-Nearest Neighbour (KNN) Algorithm and Its Different Variants for Disease Prediction.” Scientific Reports 12 (1): 6256.
Zhang, Shichao, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. 2017. “Efficient kNN Classification with Different Numbers of Nearest Neighbors.” IEEE Transactions on Neural Networks and Learning Systems 29 (5): 1774–85.
Zhang, Zhongheng. 2016. “Introduction to Machine Learning: K-Nearest Neighbors.” Annals of Translational Medicine 4 (11).